#Section 1: Getting the Data with SpotifyR and Genius
After getting your spotify client id and secret from spotify (link here), you can get started using spotifyR, which you now have to install from github.
I will focus on Kanye and his lyricism preceding the JESUS IS KING album in this sentiment analysis, so the first thing I do is filter out all albums he features on other than Watch the Throne where he is integral enough to include, although I later filter the album down to only songs on which he features.
kanye <- get_artist_audio_features('kanye west')
kanyealbums <- unique(kanye$album_name)
kanyealbums <- kanyealbums[-c(11,12,6,1,2)]
kanye <- kanye %>% dplyr::filter(album_name %in% kanyealbums)Next I obtain the data from Genius, which is really tough because just searching Kanye results in nearly 3000 songs on which he has featured (a lot of which are duplicates).
genius_get_artists <- function(artist_name, n_results = 10) {
baseURL <- 'https://api.genius.com/search?q='
requestURL <- paste0(baseURL, gsub(' ', '%20', artist_name),
'&per_page=', n_results,
'&access_token=', token)
res <- GET(requestURL) %>% content %>% .$response %>% .$hits
map_df(1:length(res), function(x) {
tmp <- res[[x]]$result$primary_artist
list(
artist_id = tmp$id,
artist_name = tmp$name
)
}) %>% unique
}
genius_artists <- genius_get_artists('kanye west')Next I get the track URLs with a while loop, which is not super well written, this was before I discovered purrr.
baseURL <- 'https://api.genius.com/artists/'
requestURL <- paste0(baseURL, genius_artists$artist_id[1], '/songs')
track_lyric_urls <- list()
i <- 1
while (i > 0) {
tmp <- GET(requestURL, query = list(access_token = token, per_page = 50, page = i)) %>% content %>% .$response
track_lyric_urls <- c(track_lyric_urls, tmp$songs)
if (!is.null(tmp$next_page)) {
i <- tmp$next_page
} else {
break
}
}Now we filter for only tracks on which Kanye is the primary artist and then do some really dirty cleaning since Genius uses a different capitalization format and type of appostrophe than R itself uses.
filtered_track_lyric_urls <- c()
filtered_track_lyric_titles <- c()
select_genius_tracks <- function(track_lyric_urls) {
if (track_lyric_urls$primary_artist$name == "Kanye West") {
filtered_track_lyric_urls <- append(filtered_track_lyric_urls, track_lyric_urls$url)
filtered_track_lyric_titles <- append(filtered_track_lyric_titles, track_lyric_urls$title)
test <- data.frame(filtered_track_lyric_titles, filtered_track_lyric_urls)
}
}
test <- purrr::map_df(track_lyric_urls, select_genius_tracks)
test$filtered_track_lyric_titles <- as.character(test$filtered_track_lyric_titles)
test <- test %>% distinct(.keep_all = T) # Doesn't actually do anything they're all distinct
test$filtered_track_lyric_titles[702] <- "Through The Wire"
test$filtered_track_lyric_titles[763] <- "We Don't Care"
test$filtered_track_lyric_titles[337] <- "I'll Fly Away"
test$filtered_track_lyric_titles[628] <- "Slow Jamz"
test$filtered_track_lyric_titles[596] <- "School Spirit Skit 1"
test$filtered_track_lyric_titles[597] <- "School Spirit Skit 2"
test$filtered_track_lyric_titles[417] <- "Lil Jimmy Skit"
test$filtered_track_lyric_titles[617] <- "Skit #1 (Kanye West/Late Registration)"
test$filtered_track_lyric_titles[618] <- "Skit #2 (Kanye West/Late Registration)"
test$filtered_track_lyric_titles[619] <- "Skit #3 (Kanye West/Late Registration)"
test$filtered_track_lyric_titles[620] <- "Skit #4 (Kanye West/Late Registration)"
test$filtered_track_lyric_titles[721] <- "Touch The Sky"
test$filtered_track_lyric_titles[291] <- "Heard 'Em Say"
test$filtered_track_lyric_titles[159] <- "Diamonds From Sierra Leone - Remix"
test$filtered_track_lyric_titles[156] <- "Diamonds From Sierra Leone - Bonus Track"
test$filtered_track_lyric_titles[111] <- "Can't Tell Me Nothing"
test$filtered_track_lyric_titles[43] <- "All Of The Lights"
test$filtered_track_lyric_titles[44] <- "All Of The Lights (Interlude)"
test$filtered_track_lyric_titles[152] <- "Devil In A New Dress"
test$filtered_track_lyric_titles[301] <- "Hell Of A Life"
test$filtered_track_lyric_titles[427] <- "Lost In The World"
test$filtered_track_lyric_titles[782] <- "Who Will Survive In America "
test$filtered_track_lyric_titles[599] <- "See You In My Nightmares"
test$filtered_track_lyric_titles[91] <- "Blood On The Leaves"
test$filtered_track_lyric_titles[328] <- "I Am A God"
test$filtered_track_lyric_titles[344] <- "I'm In It"
test$filtered_track_lyric_titles[215] <- "Father Stretch My Hands Pt. 1"
test$filtered_track_lyric_titles[230] <- "Frank's Track"
test$filtered_track_lyric_titles[481] <- "No More Parties In LA"
test$filtered_track_lyric_titles[790] <- "Wouldn't Leave"Finally we join the data.
kanye_lyric_titles <- test$filtered_track_lyric_titles %>% str_to_title() %>% as_tibble() %>%
right_join(kanye, by = c("value" = "track_name")) %>% distinct(value, .keep_all = T)
kanye_lyrics <- left_join(kanye_lyric_titles, test, by = c("value" = "filtered_track_lyric_titles")) %>% distinct(value, .keep_all = T) %>%
dplyr::rename(track_name = value) %>% arrange(album_release_date) %>% relocate(filtered_track_lyric_urls, .after = track_name) %>%
drop_na(filtered_track_lyric_urls)
kanye_lyrics$filtered_track_lyric_urls <- as.character(kanye_lyrics$filtered_track_lyric_urls)Now we can easily scrape the lyrics from Genius’ site using rvest
kanye_lyrics$lyric_text <- rep(NA, nrow(kanye_lyrics))
# Webscraping lyrics using rvest after making a NA column
kanye_lyrics$filtered_track_lyric_urls <- as.character(kanye_lyrics$filtered_track_lyric_urls)
kanye_lyrics$lyric_text <- rep(NA, nrow(kanye_lyrics))
# Function to scrape lyrics
scrape <- function(x) {
read_html(x) %>%
html_nodes(".lyrics p") %>%
html_text()
}
# scrape lyrics based on genius url
kanye_lyrics$lyric_text <- purrr::map(kanye_lyrics$filtered_track_lyric_urls, scrape) # This is around when I discovered purrr
kanye_lyrics$lyric_text <- as.character(kanye_lyrics$lyric_text)Data Cleaning
If you’ve ever been on Genius’ site before you know there’s a lot of extraneous text which needs to be cleaned up.
kanye_lyrics <- kanye_lyrics %>% mutate(lyric_text = gsub("([a-z])([A-Z])", "\\1 \\2", lyric_text)) %>%
mutate(lyric_text = gsub("\n", " ", lyric_text)) %>%
mutate(lyric_text = gsub("\\[.*?\\]", " ", lyric_text)) %>%
mutate(lyric_text = gsub(" {2,}", " ", lyric_text))
kanye_lyrics_csv <- kanye_lyrics %>% select(track_name, lyric_text)
# readr::write_csv(kanye_lyrics_csv, "/Users/dunk/Github/SpotifyStats/KanyeFinalAnalysis/kanye-lyrics.csv") # Csv write of just track and lyricsSeparately Genius data then joining the dataframes
genius_data <- data.frame(track_name = kanye_lyrics$track_name, lyrics = kanye_lyrics$lyric_text)
genius_data$track_name <- as.character(genius_data$track_name)
genius_data$lyrics <- as.character(genius_data$lyrics)
spotify_genius <- full_join(genius_data, kanye, by = "track_name") %>%
distinct(track_name, .keep_all = T) %>%
drop_na(lyrics) %>%
relocate(album_name, .after = track_name)Ordering the albums
Plotting the Data
First, let’s take a look at valence:
Filler Text 1.
Filler Text 2.
Filler Text 3.
Filler Text 4.
Filler Text 5.
Filler Text 6.
Filler Text 7.
Filler Text 8.
A couple tables here for numerical effect.
Let’s check out all the songs organized by valence
Now a plot of valence, danceability, and energy, vs. album year.
Sonic Score Table.
Another table.
Sentiment Analysis
# tokenized and cleaned datasets of lyrics for textual analysis
tidy_kanye <- spotify_genius %>% unnest_tokens(word, lyrics)
tidier_kanye <- tidy_kanye %>% anti_join(rbind(stop_words[1], "uh", "yeah", "hey", "baby", "ooh", "wanna", "gonna", "ah", "ahh", "ha", "la", "mmm", "whoa", "haa"))
tidier_kanye$word[tidier_kanye$word == "don" | tidier_kanye$word == "didn"] <- NA
tidier_kanye$word[tidier_kanye$word == "ain"] <- NA
tidier_kanye$word[tidier_kanye$word == "isn"] <- NA
tidier_kanye$word[tidier_kanye$word == "usin"] <- "using"
tidier_kanye$word[tidier_kanye$word == "wouldn"] <- "wouldn't"
tidier_kanye$word[tidier_kanye$word == "couldn"] <- "couldn't"
tidier_kanye$word[tidier_kanye$word == "shouldn"] <- "shouldn't"
tidier_kanye$word[tidier_kanye$word == "won"] <- "won't"
tidier_kanye$word[tidier_kanye$word == "ve" | tidier_kanye$word == "ll"] <- NA
tidier_kanye$word[tidier_kanye$word == "ileft"] <- "left"Wordcloud.
How many tracks does the word “kanye” appear in, and how often? Obviously a lot in “I Love Kanye”, but surprisingly enough only once in “Wake Up Mr. West”, and not even that much overall, I would not be surprised if most artists had similar amounts of self-references throughout their discography.
Wordcloud for just the album ye, you can see why sentiment analysis would pick this up as deeply depressed/angry/sad.
Wordcloud for 808s & Heartbreak, the issue is clear once again, although this time it could literally just because Kanye says “Amazing” A LOT in the song amazing, in fact he says it 55 times, which is the third most frequent by song out of all of his discography. Coming just after “bam” which isn’t sang by him in Famous, and “ey” also in Famous, it can really be argued that it is the largest personal use in his discography.
Wordcloud for My Beautiful Dark Twisted Fantasy, not much to interpret here there is a lot going on, but if you’ve heard this album this makes sense.
Lexical diversity is fairly consistently in the middle ranges, although there are some outliers in The Life of Pablo and The College Dropout that should be investigated.
Now for the real fun, sentiment analysis using NRC, AFINN, and BING, which is surprisingly easy.
# joining the tokenized, tidied lyric dataset with sentiment lexicons
kanye_nrc_sub <- tidier_kanye %>%
inner_join(get_sentiments("nrc")) %>%
dplyr::filter(!sentiment %in% c("positive", "negative"))
kanye_AFINN <- tidier_kanye %>%
inner_join(get_sentiments("afinn"))
kanye_bing <- tidier_kanye %>%
inner_join(get_sentiments("bing"))Sentiment scores with AFINN, looks like its not a great tool in general given that it got the exact opposite of what it should have - that 808s was his most depressed and lowest point, it would be useful in the future to try to weight instrumental components in this analysis, to try to give context to the data.
Pyramid Plot of The Life of Pablo
Radar chart.
Most common words in song.